This file (hdat9600_final_assignment.Rmd) is the R Markdown document in which you need to complete your HDAT9600 final assignment. This assignment is assessed and will count for 30% of the total course marks. The assignment comprises two tasks worth 15 marks each. The first task will focus on logistic regression, and the second task will focus on survival analysis. There is no word limit, but a report of about 10 pages in length when printed (except that it will not be printed) is appropriate.
Don’t hesitate to ask the course convenor for help via OpenLearning. The course instructor are happy to point you in the right direction and to make suggestions, but they won’t, of course, complete your assignments for you!
The data used for this assignment consist of records from Intensive Care Unit (ICU) hospital stays in the USA. All patients were adults who were admitted for a wide variety of reasons. ICU stays of less than 48 hours have been excluded.
The source data for the assignment are data made freely available for the 2012 MIT PhysioNet/Computing for Cardiology Challenge. Details are provided here. Training Set A data have been used. The original data has been modified and assembled to suit the purpose of this assignment. While not required for the purposes of this assignment, full details of the preparatory work can be found in the hdat9600-final-assignment-data-preparation file.
The dataframe consists of 120 variables, which are defined as follows:
Use the hyperlinks below to find out more about the clinical meaning of each variable. The first two clinical variables are summary scores that are used to assess patient condition and risk.
The following 36 clinical measures were assessed at multiple timepoints during each patient’s ICU stay. For each of the 36 clinical measures, you are given 3 summary variables: a) The minimum value during the first 24 hours in ICU (_min), b) The maximum value during the first 24 hours in ICU (_max), and c) The difference between the mean and the most extreme values during the first 24 hours in ICU (_diff). For example, for the clinical measure Cholesterol, these three variables are labelled ‘Cholesterol_min’, ‘Cholesterol_max’, and ‘Cholesterol_diff’.
The data frame can be loaded with the following code:
icu_patients_df0 <- readRDS("icu_patients_df0.rds")
icu_patients_df1 <- readRDS("icu_patients_df1.rds")
Note: icu_patients_df1 is an imputed (i.e. missing values are ‘derived’) version of icu_patients_df0. This assignment does not concern the methods used for imputation.
In this task, you are required to develop a logistic regression model using the icu_patients_df1 data set which adequately explains or predicts the in_hospital_death variable as the outcome using a subset of the available predictor variables. You should fit a series of models, evaluating each one, before you present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge. It is perfectly acceptable to include predictor variables in your final model which are not statistically significant, as long as you justify their inclusion on medical or physiological grounds (you will not be marked down if your medical justification is not exactly correct or complete, but do you best). Aim for between five and ten predictor variables (slightly more or fewer is OK). You should assess each model you consider for goodness of fit and other relevant statistics to help you choose between them. For your final model, present a set of diagnostic statistics and/or charts and comment on them. You don’t need to do an exhaustive exploratory data analysis of all the variables in the data set, but you should examine those variables that you use in your model. Finally, re-fit your final model to the unimputed data frame (icu_patients_df0.rds) and comment on any differences you find compared to the same model fitted to the imputed data.
Select an initial subset of explanatory variables that you will use to predict the risk of in-hospital death. Justify your choice.
Conduct basic exploratory data analysis on your variables of choice.
Fit appropriate univariate logistic regression models.
Fit an appropriate series of multivariable logistic regression models, justifying your approach. Assess each model you consider for goodness of fit and other relevant statistics.
Present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge.
For your final model, present a set of diagnostic statistics and/or charts and comment on them.
Write a paragraph summarising the most important findings of your final model. Include the most important values from the statistical output, and a simple clinical interpretation.
Create your response to this task here, as a mixture of embedded (knitr) R code and any resulting outputs, and explanatory or commentary text.
Regarding which variables to include in the model selection process we decided to first include the basic patient descriptor variables age and gender, as it is a good idea to inspect these regardless of the sort of their suspected significance. Furthermore, the type of ICU and length of stay were also included, as were both the SAPS-I score and the SOFA score. We hypothesized the ICU type may be an indicator of the severity of the patient’s condition, and thus a reasonable predictor of mortality. The SOFA AND SAPS-I scores are risk assessment measures, ergo they are an obvious choice to include for further inspection.
We also included the minimum white blood cell count. A low white blood cell count is indicative of a weak immune system, a person with no white blood cells is incapable of fighting infection, and thus is more likely to die from a variety of conditions than someone with a sufficient amount of white blood cells. The maximum creatine level was included as this is a common measure of renal function. Fi02 was chosen as a candidate variable over Pa02 because you can titrate Fi02 to correct low Pa02 until later stages of disease. Maximum lactate was examined as it is often elevated in states of hypoperfusion. Extremes at the high and low ends of the pH scale often warrant a poor prognosis, thus both considered. Tropinon, a cardiac enzyme, is elevated in myocardial ischaemia. Finally, temperature and glucose levels, both at a maximum and minimum, were explored.
# Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(DescTools)
library(faraway)
library(arm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.11-2, built: 2020-7-27)
## Working directory is /Users/joshbryden/Google Drive/UNSW/2020-T2/HDAT9600/HDAT9600_Group_Project
##
## Attaching package: 'arm'
## The following objects are masked from 'package:faraway':
##
## fround, logit, pfround
library(ROCR)
# Reading in the data sets
icu_patients_df0 <- readRDS("icu_patients_df0.rds")
icu_patients_df1 <- readRDS("icu_patients_df1.rds")
From the EDA below for patient descriptor variables it can be observed that the oldest patient in the data set was 90, the youngest 16, with the median age being 67. On average, the women admitted to the ICU were slightly older than the men, 65.5 years old compared to 63.5. However, by frequency there were more male patients (1148) than female patients (913).
# EDA for the patient descriptor variables
# -------------------- General summary statistics by variable ----------------#
summary(icu_patients_df1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 52.00 67.00 64.41 78.00 90.00
table(icu_patients_df1$Gender)
##
## Female Male
## 913 1148
# ---------------------- Age by gender ------------------------------------- #
# Plot
ggplot(data = icu_patients_df1, aes(x = Age, fill = Gender)) + geom_histogram() +
facet_grid(Gender ~ .) + ggtitle("Distribution of age by gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Summary Statistics
group_by(icu_patients_df1, Gender) %>% summarize(mean_age = mean(Age))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Gender mean_age
## <fct> <dbl>
## 1 Female 65.5
## 2 Male 63.5
The clinical variables are SOFA and SAPS-I scores. SOFA stands for sequential organ failure assessment. It is a mortality prediction score based on the degree of dysfunction of six organ systems. The SAPS-I score is another mortality risk prediction measure that is calculated from 14 routine physiological measurements. Note that in the histograms below the vertical red line indicates the mean of the plotted measurement.
# EDA for clinical variables
# SOFA summary and plot
summary(icu_patients_df1$SOFA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000 3.000 6.000 6.441 9.000 22.000
ggplot(data = icu_patients_df1, aes(x = SOFA)) + geom_histogram(fill = "cornflowerblue") +
geom_vline(xintercept = mean(icu_patients_df1$SOFA), col = "red") +
ggtitle("Distribution of SOFA scores")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# SAPS1 summary and plot
summary(icu_patients_df1$SAPS1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 11.00 15.00 14.96 19.00 34.00 96
ggplot(data = icu_patients_df1, aes(x = SAPS1)) + geom_histogram(fill = "cornflowerblue") +
geom_vline(xintercept = mean(na.exclude(icu_patients_df1$SAPS1)), col = "red") +
ggtitle("Distribution of SAPS1 scores")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 96 rows containing non-finite values (stat_bin).
Brief EDA for the remaining clinical variables is presented below. Similarly to the above plots, the vertical line represents the mean for that particular measure. The distributions vary in shape, but structurally everything appears to be in order with the data entry of the measurements.
# EDA for other variables
# ICU type
table(icu_patients_df1$ICUType)
##
## Coronary Care Unit Cardiac Surgery Recovery Unit
## 297 448
## Medical ICU Surgical ICU
## 788 528
# White blood cell (minimum) count
summary(icu_patients_df1$WBC_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 7.60 10.40 11.51 14.10 128.30
ggplot(data = icu_patients_df1, aes(x = WBC_min)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$WBC_min), col = "cornflowerblue") +
ggtitle("Distribution of minimum white blooc cell counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# WBC diff count
summary(icu_patients_df1$WBC_diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03315 2.63315 4.53315 5.82079 7.23315 143.46685
ggplot(data = icu_patients_df1, aes(x = WBC_diff)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$WBC_diff), col = "cornflowerblue") +
ggtitle("Distribution of white blood cell count differences between mean and most extreme value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Creatine (max)
summary(icu_patients_df1$Creatinine_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.200 0.800 1.000 1.499 1.500 22.000
ggplot(data = icu_patients_df1, aes(x = Creatinine_max)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$Creatinine_max), col = "cornflowerblue") +
ggtitle("Distribution of maximum creatinine measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# albumin (min)
summary(icu_patients_df1$Albumin_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.100 2.600 3.000 3.012 3.500 5.300
ggplot(data = icu_patients_df1, aes(x = Albumin_min)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$Albumin_min), col = "cornflowerblue") +
ggtitle("Distribution of minimum albumin measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Fi02
summary(icu_patients_df1$FiO2_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2800 0.5000 1.0000 0.7874 1.0000 1.0000
ggplot(data = icu_patients_df1, aes(x = FiO2_max)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$FiO2_max), col = "cornflowerblue") +
ggtitle("Distribution of maximum FiO2 measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# lactate (max)
summary(icu_patients_df1$Lactate_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.400 1.500 2.200 2.773 3.200 29.300
ggplot(data = icu_patients_df1, aes(x = Lactate_max)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$Lactate_max), col = "cornflowerblue") +
ggtitle("Distribution of maximum lactate measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# pH (max and min)
summary(icu_patients_df1$pH_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.150 7.380 7.420 7.418 7.460 7.690
ggplot(data = icu_patients_df1, aes(x = pH_max)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$pH_max), col = "cornflowerblue") +
ggtitle("Distribution of maximum pH measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(icu_patients_df1$pH_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 7.280 7.340 7.327 7.390 7.630
ggplot(data = icu_patients_df1, aes(x = pH_min)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$pH_min), col = "cornflowerblue") +
ggtitle("Distribution of minimum pH measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Temperature(max and min)
summary(icu_patients_df1$Temp_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.40 37.10 37.60 37.69 38.20 42.10
ggplot(data = icu_patients_df1, aes(x = Temp_max)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$Temp_max), col = "cornflowerblue") +
ggtitle("Distribution of maximum temperature measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(icu_patients_df1$Temp_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.20 35.60 36.10 36.01 36.60 38.30
ggplot(data = icu_patients_df1, aes(x = Temp_min)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$Temp_min), col = "cornflowerblue") +
ggtitle("Distribution of minimum temperature measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# MAP (min and max)
summary(icu_patients_df1$MAP_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 94.0 104.0 111.8 117.0 291.0
ggplot(data = icu_patients_df1, aes(x = MAP_max)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$MAP_max), col = "cornflowerblue") +
ggtitle("Distribution of maximum MAP measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(icu_patients_df1$MAP_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 55.00 61.00 62.76 70.00 265.00
ggplot(data = icu_patients_df1, aes(x = MAP_min)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$MAP_min), col = "cornflowerblue") +
ggtitle("Distribution of minimum MAP measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Glusoce (max and min)
summary(icu_patients_df1$Glucose_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.0 117.0 141.0 163.3 180.0 1143.0
ggplot(data = icu_patients_df1, aes(x = Glucose_max)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$Glucose_max), col = "cornflowerblue") +
ggtitle("Distribution of maximum glusoce measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(icu_patients_df1$Glucose_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 98.0 117.0 124.8 141.0 632.0
ggplot(data = icu_patients_df1, aes(x = Glucose_min)) + geom_histogram(fill = "salmon1") +
geom_vline(xintercept = mean(icu_patients_df1$Glucose_min), col = "cornflowerblue") +
ggtitle("Distribution of minimum glusoce measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next, we fit each candidate variable as a univariate predictor of in hospital mortality. Out of the patient descriptor variables age was found to be significant while gender was not. Both clinical risk predictions scores were found to be significant, as was ICU type. Among the additional clinical measures we chose as candidate variables only FiO2 (max), pH (max), temperature (max), MAP (max and min), and glucose (min) were found to not be significant.
# Fitting univariate models for the predictors chosen so far
# Univariate fits
age_fit <- glm(in_hospital_death ~ Age, family = "binomial", data = icu_patients_df1)
gender_fit <- glm(in_hospital_death ~ Gender, family = "binomial", data = icu_patients_df1)
SAPS_fit <- glm(in_hospital_death ~ SAPS1, family = "binomial", data = icu_patients_df1)
SOFA_fit <- glm(in_hospital_death ~ SOFA, family = "binomial", data = icu_patients_df1)
WBCmin_fit <- glm(in_hospital_death ~ WBC_min, family = "binomial", data = icu_patients_df1)
WBCdiff_fit <- glm(in_hospital_death ~ WBC_diff, family = "binomial", data = icu_patients_df1)
creatine_fit <- glm(in_hospital_death ~ Creatinine_max, family = "binomial", data = icu_patients_df1)
albumin_fit <- glm(in_hospital_death ~ Albumin_min, family = "binomial", data = icu_patients_df1)
fio2_fit <- glm(in_hospital_death ~ FiO2_max, family = "binomial", data = icu_patients_df1)
lactate_fit <- glm(in_hospital_death ~ Lactate_max, family = "binomial", data = icu_patients_df1)
pHmax_fit <- glm(in_hospital_death ~ pH_max, family = "binomial", data = icu_patients_df1)
pHmin_fit <- glm(in_hospital_death ~ pH_min, family = "binomial", data = icu_patients_df1)
tempmax_fit <- glm(in_hospital_death ~ Temp_max, family = "binomial", data = icu_patients_df1)
tempmin_fit <- glm(in_hospital_death ~ Temp_min, family = "binomial", data = icu_patients_df1)
mapmax_fit <- glm(in_hospital_death ~ MAP_max, family = "binomial", data = icu_patients_df1)
mapmin_fit <- glm(in_hospital_death ~ MAP_min, family = "binomial", data = icu_patients_df1)
glucosemax_fit <- glm(in_hospital_death ~ Glucose_max, family = "binomial", data = icu_patients_df1)
glucosemin_fit <- glm(in_hospital_death ~ Glucose_min, family = "binomial", data = icu_patients_df1)
ICUtype_fit <- glm(in_hospital_death ~ ICUType, family = "binomial", data = icu_patients_df1)
# Summaries
summary(age_fit) # significant, AIC = 1648.9
##
## Call:
## glm(formula = in_hospital_death ~ Age, family = "binomial", data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7522 -0.6264 -0.5111 -0.3919 2.5135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.761624 0.303337 -12.401 < 2e-16 ***
## Age 0.029376 0.004229 6.947 3.73e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1644.9 on 2059 degrees of freedom
## AIC: 1648.9
##
## Number of Fisher Scoring iterations: 5
summary(gender_fit) # non significant
##
## Call:
## glm(formula = in_hospital_death ~ Gender, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5612 -0.5612 -0.5553 -0.5553 1.9728
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.76894 0.09381 -18.856 <2e-16 ***
## GenderMale -0.02281 0.12615 -0.181 0.856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1699.7 on 2059 degrees of freedom
## AIC: 1703.7
##
## Number of Fisher Scoring iterations: 4
summary(SAPS_fit) # significant, AIC = 1534.8
##
## Call:
## glm(formula = in_hospital_death ~ SAPS1, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3834 -0.5894 -0.4662 -0.3448 2.6278
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.79727 0.23888 -15.896 <2e-16 ***
## SAPS1 0.12558 0.01338 9.384 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1627.0 on 1964 degrees of freedom
## Residual deviance: 1530.8 on 1963 degrees of freedom
## (96 observations deleted due to missingness)
## AIC: 1534.8
##
## Number of Fisher Scoring iterations: 5
summary(SOFA_fit) # significant, AIC = 1611.5
##
## Call:
## glm(formula = in_hospital_death ~ SOFA, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0747 -0.5835 -0.4771 -0.3623 2.4609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.83453 0.14090 -20.117 <2e-16 ***
## SOFA 0.14378 0.01539 9.342 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1607.5 on 2059 degrees of freedom
## AIC: 1611.5
##
## Number of Fisher Scoring iterations: 5
summary(WBCmin_fit) # significant, AIC = 1699.6
##
## Call:
## glm(formula = in_hospital_death ~ WBC_min, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2861 -0.5638 -0.5477 -0.5315 2.0563
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.987167 0.119356 -16.649 <2e-16 ***
## WBC_min 0.017452 0.008437 2.068 0.0386 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1695.6 on 2059 degrees of freedom
## AIC: 1699.6
##
## Number of Fisher Scoring iterations: 4
summary(WBCdiff_fit) # significant, AIC = 1694.7 ,ONLY CHOOSE WBC_MIN OR WBC_DIFF TO AVOID MULTICOLLINEARITY
##
## Call:
## glm(formula = in_hospital_death ~ WBC_diff, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9566 -0.5607 -0.5420 -0.5273 2.0360
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.939861 0.084336 -23.002 < 2e-16 ***
## WBC_diff 0.025751 0.008809 2.923 0.00346 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1690.7 on 2059 degrees of freedom
## AIC: 1694.7
##
## Number of Fisher Scoring iterations: 4
summary(creatine_fit) # significant, AIC = 1678.4
##
## Call:
## glm(formula = in_hospital_death ~ Creatinine_max, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8627 -0.5433 -0.5270 -0.5151 2.0633
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.05087 0.08430 -24.328 < 2e-16 ***
## Creatinine_max 0.16325 0.03135 5.208 1.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1674.4 on 2059 degrees of freedom
## AIC: 1678.4
##
## Number of Fisher Scoring iterations: 4
summary(albumin_fit) # significant, AIC = 1680
##
## Call:
## glm(formula = in_hospital_death ~ Albumin_min, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7948 -0.5887 -0.5385 -0.4595 2.2842
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.36392 0.29389 -1.238 0.216
## Albumin_min -0.48186 0.09987 -4.825 1.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1676.0 on 2059 degrees of freedom
## AIC: 1680
##
## Number of Fisher Scoring iterations: 4
summary(fio2_fit) # not significant
##
## Call:
## glm(formula = in_hospital_death ~ FiO2_max, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5634 -0.5634 -0.5555 -0.5504 1.9898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8614 0.2102 -8.856 <2e-16 ***
## FiO2_max 0.1011 0.2533 0.399 0.69
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1699.5 on 2059 degrees of freedom
## AIC: 1703.5
##
## Number of Fisher Scoring iterations: 4
summary(lactate_fit) # significant, AIC = 1673.5
##
## Call:
## glm(formula = in_hospital_death ~ Lactate_max, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1726 -0.5544 -0.5200 -0.4939 2.1212
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1932 0.1005 -21.820 < 2e-16 ***
## Lactate_max 0.1372 0.0244 5.625 1.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1669.5 on 2059 degrees of freedom
## AIC: 1673.5
##
## Number of Fisher Scoring iterations: 4
summary(pHmax_fit) # not significant
##
## Call:
## glm(formula = in_hospital_death ~ pH_max, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6684 -0.5677 -0.5523 -0.5297 2.0743
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.3001 7.0197 1.325 0.185
## pH_max -1.4944 0.9469 -1.578 0.115
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1697.2 on 2059 degrees of freedom
## AIC: 1701.2
##
## Number of Fisher Scoring iterations: 4
summary(pHmin_fit) # significant, AIC = 1681.4
##
## Call:
## glm(formula = in_hospital_death ~ pH_min, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9980 -0.5733 -0.5358 -0.4868 2.2874
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 19.5912 4.8996 3.998 6.37e-05 ***
## pH_min -2.9197 0.6699 -4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1677.4 on 2059 degrees of freedom
## AIC: 1681.4
##
## Number of Fisher Scoring iterations: 4
summary(tempmax_fit) # not significant
##
## Call:
## glm(formula = in_hospital_death ~ Temp_max, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6077 -0.5689 -0.5549 -0.5366 2.1386
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.60419 3.08152 0.521 0.603
## Temp_max -0.08988 0.08183 -1.098 0.272
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1698.5 on 2059 degrees of freedom
## AIC: 1702.5
##
## Number of Fisher Scoring iterations: 4
summary(tempmin_fit) # significant, AIC = 1688.3
##
## Call:
## glm(formula = in_hospital_death ~ Temp_min, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8918 -0.5741 -0.5409 -0.4973 2.2040
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.4599 2.3473 3.178 0.00148 **
## Temp_min -0.2571 0.0654 -3.931 8.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1684.3 on 2059 degrees of freedom
## AIC: 1688.3
##
## Number of Fisher Scoring iterations: 4
summary(mapmax_fit) # not significant, AIC = 1703.7
##
## Call:
## glm(formula = in_hospital_death ~ MAP_max, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5583 -0.5579 -0.5579 -0.5578 1.9695
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.779817 0.215732 -8.250 <2e-16 ***
## MAP_max -0.000016 0.001846 -0.009 0.993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1699.7 on 2059 degrees of freedom
## AIC: 1703.7
##
## Number of Fisher Scoring iterations: 4
summary(mapmin_fit) # not significant
##
## Call:
## glm(formula = in_hospital_death ~ MAP_min, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6583 -0.5674 -0.5551 -0.5341 2.4214
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.413112 0.257434 -5.489 4.04e-08 ***
## MAP_min -0.005926 0.004051 -1.463 0.143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1697.4 on 2059 degrees of freedom
## AIC: 1701.4
##
## Number of Fisher Scoring iterations: 4
summary(glucosemax_fit) # significant, AIC = 1688.2
##
## Call:
## glm(formula = in_hospital_death ~ Glucose_max, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4117 -0.5572 -0.5343 -0.5162 2.0872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1865370 0.1202802 -18.179 < 2e-16 ***
## Glucose_max 0.0023817 0.0005819 4.093 4.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1684.2 on 2059 degrees of freedom
## AIC: 1688.2
##
## Number of Fisher Scoring iterations: 4
summary(glucosemin_fit) # not significant
##
## Call:
## glm(formula = in_hospital_death ~ Glucose_min, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7799 -0.5613 -0.5522 -0.5428 2.0271
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.967537 0.171241 -11.490 <2e-16 ***
## Glucose_min 0.001476 0.001253 1.178 0.239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1698.4 on 2059 degrees of freedom
## AIC: 1702.4
##
## Number of Fisher Scoring iterations: 4
summary(ICUtype_fit) # significant, AIC = 1663.3
##
## Call:
## glm(formula = in_hospital_death ~ ICUType, family = "binomial",
## data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6402 -0.6402 -0.5615 -0.3458 2.3861
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6463 0.1576 -10.443 < 2e-16 ***
## ICUTypeCardiac Surgery Recovery Unit -1.1407 0.2563 -4.451 8.55e-06 ***
## ICUTypeMedical ICU 0.1653 0.1824 0.906 0.365
## ICUTypeSurgical ICU -0.1214 0.2001 -0.607 0.544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1699.7 on 2060 degrees of freedom
## Residual deviance: 1655.3 on 2057 degrees of freedom
## AIC: 1663.3
##
## Number of Fisher Scoring iterations: 5
From the univariate fits we chose to further examine a number of models. The first of which contained all predictors significant in the univariate analyses, as well as gender. From there, given that all candidate variables have been deemed either statistically and/or clinically significant, we employed backwards elimination to minimize AIC and find the optimal model. Judging by AIC, BIC, and other metric provided by the PseudoR2 function it can be observed that the model fit using backwards elimination is preferable. Thus, our final model included the predictors: age, SAPS1, SOFA, maximum lactate measurement, as well as ICU type. While this model may seem rather simple, it should be noted that the SOFA and SAPS1 scores are derived from many of the clinical measurements in the data set and thus their importance is captured in the background.
# Skeleton code for picking the next few models
# Candidate models from above variables
cand_1_fit <- glm(in_hospital_death ~ Age + Gender + SAPS1 + SOFA + WBC_diff + Creatinine_max + Albumin_min + Lactate_max +
pH_min + Temp_min + Glucose_max + ICUType, family = "binomial", data = icu_patients_df1)
cand_backelim_fit <- step(cand_1_fit, direction = "back")
## Start: AIC=1421.26
## in_hospital_death ~ Age + Gender + SAPS1 + SOFA + WBC_diff +
## Creatinine_max + Albumin_min + Lactate_max + pH_min + Temp_min +
## Glucose_max + ICUType
##
## Df Deviance AIC
## - Glucose_max 1 1391.3 1419.3
## - Gender 1 1391.4 1419.4
## - WBC_diff 1 1391.8 1419.8
## - Temp_min 1 1392.1 1420.1
## - Creatinine_max 1 1393.0 1421.0
## - pH_min 1 1393.0 1421.0
## - Lactate_max 1 1393.2 1421.2
## <none> 1391.3 1421.3
## - Albumin_min 1 1393.9 1421.9
## - SOFA 1 1402.9 1430.9
## - SAPS1 1 1403.8 1431.8
## - Age 1 1432.2 1460.2
## - ICUType 3 1459.7 1483.7
##
## Step: AIC=1419.28
## in_hospital_death ~ Age + Gender + SAPS1 + SOFA + WBC_diff +
## Creatinine_max + Albumin_min + Lactate_max + pH_min + Temp_min +
## ICUType
##
## Df Deviance AIC
## - Gender 1 1391.5 1417.5
## - WBC_diff 1 1391.8 1417.8
## - Temp_min 1 1392.2 1418.2
## - Creatinine_max 1 1393.0 1419.0
## - pH_min 1 1393.0 1419.0
## <none> 1391.3 1419.3
## - Lactate_max 1 1393.4 1419.4
## - Albumin_min 1 1393.9 1419.9
## - SOFA 1 1402.9 1428.9
## - SAPS1 1 1404.5 1430.5
## - Age 1 1432.2 1458.2
## - ICUType 3 1464.2 1486.2
##
## Step: AIC=1417.45
## in_hospital_death ~ Age + SAPS1 + SOFA + WBC_diff + Creatinine_max +
## Albumin_min + Lactate_max + pH_min + Temp_min + ICUType
##
## Df Deviance AIC
## - WBC_diff 1 1392.0 1416.0
## - Temp_min 1 1392.3 1416.3
## - pH_min 1 1393.2 1417.2
## - Creatinine_max 1 1393.3 1417.3
## <none> 1391.5 1417.5
## - Lactate_max 1 1393.6 1417.6
## - Albumin_min 1 1394.0 1418.0
## - SOFA 1 1403.4 1427.4
## - SAPS1 1 1404.5 1428.5
## - Age 1 1432.2 1456.2
## - ICUType 3 1464.2 1484.2
##
## Step: AIC=1415.97
## in_hospital_death ~ Age + SAPS1 + SOFA + Creatinine_max + Albumin_min +
## Lactate_max + pH_min + Temp_min + ICUType
##
## Df Deviance AIC
## - Temp_min 1 1392.9 1414.9
## - pH_min 1 1393.6 1415.6
## - Creatinine_max 1 1393.8 1415.8
## <none> 1392.0 1416.0
## - Lactate_max 1 1394.1 1416.1
## - Albumin_min 1 1394.4 1416.4
## - SOFA 1 1404.1 1426.1
## - SAPS1 1 1404.6 1426.6
## - Age 1 1433.4 1455.4
## - ICUType 3 1464.2 1482.2
##
## Step: AIC=1414.88
## in_hospital_death ~ Age + SAPS1 + SOFA + Creatinine_max + Albumin_min +
## Lactate_max + pH_min + ICUType
##
## Df Deviance AIC
## - pH_min 1 1394.6 1414.6
## - Creatinine_max 1 1394.7 1414.7
## <none> 1392.9 1414.9
## - Albumin_min 1 1395.4 1415.4
## - Lactate_max 1 1395.7 1415.7
## - SOFA 1 1404.8 1424.8
## - SAPS1 1 1406.9 1426.9
## - Age 1 1434.9 1454.9
## - ICUType 3 1464.3 1480.3
##
## Step: AIC=1414.62
## in_hospital_death ~ Age + SAPS1 + SOFA + Creatinine_max + Albumin_min +
## Lactate_max + ICUType
##
## Df Deviance AIC
## - Creatinine_max 1 1396.5 1414.5
## <none> 1394.6 1414.6
## - Albumin_min 1 1397.0 1415.0
## - Lactate_max 1 1398.2 1416.2
## - SOFA 1 1408.5 1426.5
## - SAPS1 1 1408.8 1426.8
## - Age 1 1435.9 1453.9
## - ICUType 3 1466.4 1480.4
##
## Step: AIC=1414.51
## in_hospital_death ~ Age + SAPS1 + SOFA + Albumin_min + Lactate_max +
## ICUType
##
## Df Deviance AIC
## - Albumin_min 1 1398.5 1414.5
## <none> 1396.5 1414.5
## - Lactate_max 1 1399.8 1415.8
## - SAPS1 1 1410.9 1426.9
## - SOFA 1 1413.8 1429.8
## - Age 1 1437.1 1453.1
## - ICUType 3 1476.3 1488.3
##
## Step: AIC=1414.47
## in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max + ICUType
##
## Df Deviance AIC
## <none> 1398.5 1414.5
## - Lactate_max 1 1402.0 1416.0
## - SAPS1 1 1412.9 1426.9
## - SOFA 1 1419.0 1433.0
## - Age 1 1439.2 1453.2
## - ICUType 3 1481.9 1491.9
# Summaries
summary(cand_1_fit)
##
## Call:
## glm(formula = in_hospital_death ~ Age + Gender + SAPS1 + SOFA +
## WBC_diff + Creatinine_max + Albumin_min + Lactate_max + pH_min +
## Temp_min + Glucose_max + ICUType, family = "binomial", data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4315 -0.5690 -0.3992 -0.2439 2.9331
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.7125053 4.2517991 0.403 0.687116
## Age 0.0293566 0.0048130 6.099 1.06e-09 ***
## GenderMale 0.0578426 0.1420371 0.407 0.683835
## SAPS1 0.0716758 0.0203541 3.521 0.000429 ***
## SOFA 0.0794779 0.0235266 3.378 0.000730 ***
## WBC_diff -0.0075925 0.0112084 -0.677 0.498157
## Creatinine_max 0.0553524 0.0413612 1.338 0.180809
## Albumin_min -0.1895045 0.1165692 -1.626 0.104017
## Lactate_max 0.0436155 0.0312626 1.395 0.162976
## pH_min -0.5661247 0.4512019 -1.255 0.209586
## Temp_min -0.0701843 0.0760048 -0.923 0.355789
## Glucose_max 0.0001117 0.0007467 0.150 0.881065
## ICUTypeCardiac Surgery Recovery Unit -1.5891191 0.2862601 -5.551 2.84e-08 ***
## ICUTypeMedical ICU 0.1845113 0.2057159 0.897 0.369760
## ICUTypeSurgical ICU -0.0859675 0.2251218 -0.382 0.702557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1627.0 on 1964 degrees of freedom
## Residual deviance: 1391.3 on 1950 degrees of freedom
## (96 observations deleted due to missingness)
## AIC: 1421.3
##
## Number of Fisher Scoring iterations: 5
summary(cand_backelim_fit)
##
## Call:
## glm(formula = in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max +
## ICUType, family = "binomial", data = icu_patients_df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4353 -0.5710 -0.4000 -0.2492 2.9116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.570809 0.430858 -12.930 < 2e-16 ***
## Age 0.028988 0.004763 6.087 1.15e-09 ***
## SAPS1 0.071969 0.019024 3.783 0.000155 ***
## SOFA 0.098617 0.022035 4.475 7.63e-06 ***
## Lactate_max 0.055024 0.029396 1.872 0.061228 .
## ICUTypeCardiac Surgery Recovery Unit -1.634703 0.275855 -5.926 3.11e-09 ***
## ICUTypeMedical ICU 0.200727 0.202790 0.990 0.322259
## ICUTypeSurgical ICU -0.102964 0.220491 -0.467 0.640518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1627.0 on 1964 degrees of freedom
## Residual deviance: 1398.5 on 1957 degrees of freedom
## (96 observations deleted due to missingness)
## AIC: 1414.5
##
## Number of Fisher Scoring iterations: 5
# Comparing the two
PseudoR2(cand_1_fit, which = "all")
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 0.1449129 0.1264745 0.1130701 0.2008057 0.1071343
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## 0.2365221 0.1390483 0.2642414 0.1372283 1421.2600331
## BIC logLik logLik0 G2
## 1505.0087460 -695.6300166 -813.5195273 235.7790214
PseudoR2(cand_backelim_fit, which = "all")
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 0.1404791 0.1306452 0.1098079 0.1950123 0.1041979
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## 0.2300393 0.1356312 0.2593508 0.1328953 1414.4741522
## BIC logLik logLik0 G2
## 1459.1401324 -699.2370761 -813.5195273 228.5649023
When diagnosing the quality of the model we first examined the binned residuals plot. Initially, slight alarm was raised over the residuals on the leftmost side of the bottom half of the plot. However, the residuals seem to broadly be acceptable. The halfnorm plot indicated two values of concern. Upon further investigation these patients had normal measurements for the variables used in our final model. Given the large size of the data these two patients should not be of much concern.
# Renaming final model
final_model <- cand_backelim_fit
# Residual checks
binnedplot(predict(final_model), residuals(final_model))
# Outlier checks
halfnorm(hatvalues(final_model))
# Inspecting suspect values
icu_patients_df1[hatvalues(final_model) > 0.04, c("in_hospital_death", "Age", "ICUType", "SOFA", "SAPS1")]
## in_hospital_death Age ICUType SOFA SAPS1
## 162 1 71 Medical ICU -1 13
## 1597 0 77 Medical ICU 12 24
# ROC curve
#pred_prob = predict(final_model, type = "response")
#label_vec <- na.exclude(icu_patients_df1$in_hospital_death)
#pred <- prediction(pred_prob, label_vec, label.ordering = NULL)
#perf <- performance(pred, measure = "tpr", x.measure = "fpr")
#plot(perf, col = rainbow(10))
When interpreting the model it is best to default to the odds, thus, we first exponentiated the repression coefficients. The interpretation of such odds are as follows: the odds of dying in the ICU are 2.9% greater for every 1 year increase in age, 7.4% greater for every unit increase in SAPS1 score, 10.3% higher for every unit increase in SOFA score, 5.6% greater for every unit increase in maximum lactate measurement, 80.5% lower if you are in the surgical recovery unit compared to baseline (coronary care unit), 22.2% higher if you are in the medical ICU compared to baseline, and about 10% lower if you are in the surgical ICU compared to baseline.
final_coef <- coef(final_model)
exp(final_coef)
## (Intercept) Age
## 0.0038074 1.0294126
## SAPS1 SOFA
## 1.0746216 1.1036440
## Lactate_max ICUTypeCardiac Surgery Recovery Unit
## 1.0565663 0.1950102
## ICUTypeMedical ICU ICUTypeSurgical ICU
## 1.2222905 0.9021597
After fitting the model the the unimputed data set we can see that the AIC is actually much lower, and more variables are significant at the 0.05 level.
# Fitting final model to second ICU data set
final_model_newdata <- glm(in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max + ICUType, data = icu_patients_df0)
summary(final_model_newdata)
##
## Call:
## glm(formula = in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max +
## ICUType, data = icu_patients_df0)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.55141 -0.20650 -0.10595 0.01153 1.12886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2354064 0.0648056 -3.633 0.000296 ***
## Age 0.0027118 0.0007015 3.866 0.000118 ***
## SAPS1 0.0083246 0.0030865 2.697 0.007117 **
## SOFA 0.0154750 0.0036109 4.286 2.01e-05 ***
## Lactate_max 0.0166539 0.0048162 3.458 0.000568 ***
## ICUTypeCardiac Surgery Recovery Unit -0.2382688 0.0451132 -5.282 1.59e-07 ***
## ICUTypeMedical ICU -0.0268378 0.0408673 -0.657 0.511528
## ICUTypeSurgical ICU -0.0901556 0.0422844 -2.132 0.033251 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1220829)
##
## Null deviance: 136.76 on 963 degrees of freedom
## Residual deviance: 116.71 on 956 degrees of freedom
## (1097 observations deleted due to missingness)
## AIC: 718.34
##
## Number of Fisher Scoring iterations: 2
In this task, you are required to develop a Cox proportional hazards survival model using the icu_patients_df1 data set which adequately explains or predicts the length of survival indicated by the Days variable, with censoring as indicated by the Status variable. You should fit a series of models, maybe three or four, evaluating each one, before you present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge. Aim for between five and ten predictor variables (slightly more or fewer is OK). It is perfectly acceptable to include predictor variables in your final model which are not statistically significant, as long as you justify their inclusion on medical or physiological grounds (you will not be marked down if your medical justification is not exactly correct, but do you best). You should assess each model you consider for goodness of fit and other relevant statistics, and you should assess your final model for violations of assumptions and perform other diagnostics which you think are relevant (and modify the model if indicated, or at least comment on the possible impact of what your diagnostics show). Finally, re-fit your final model to the unimputed data frame (icu_patients_df0.rds) and comment on any differences you find.
Select an initial subset of explanatory variables that you will use to predict survival. Justify your choice.
Conduct basic exploratory data analysis on your variables of choice.
Fit appropriate univariate Cox proportional hazards models.
Fit an appropriate series of multivariable Cox proportional hazards models, justifying your approach. Assess each model you consider for goodness of fit and other relevant statistics.
Present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge.
For your final model, present a set of diagnostic statistics and/or charts and comment on them.
Write a very brief paragraph summarising the most important findings of your final model. Include the most important values from the statistical output, and a simple clinical interpretation.
Create your response to this task here, as a mixture of embedded (knitr) R code and any resulting outputs, and explanatory or commentary text.
# basic EDA in task2
# load the survival library
library(survival)
##
## Attaching package: 'survival'
## The following objects are masked from 'package:faraway':
##
## rats, solder
# read the dataset
head(icu_patients_df1)
## RecordID Length_of_stay SAPS1 SOFA Survival in_hospital_death Days Status Age
## 1 132539 5 6 1 NA 0 2408 FALSE 54
## 2 132540 8 16 8 NA 0 2408 FALSE 76
## 3 132541 19 21 11 NA 0 2408 FALSE 44
## 4 132543 9 7 1 575 0 575 TRUE 68
## 5 132545 4 17 2 918 0 918 TRUE 88
## 6 132547 6 14 11 1637 0 1637 TRUE 64
## Albumin_diff Albumin_max Albumin_min ALP_diff ALP_max ALP_min ALT_diff
## 1 0.2186633 3.2 3.1 118.147964 214 202 80.44617
## 2 0.8813367 2.1 2.2 252.147964 338 348 94.44617
## 3 0.6813367 2.7 2.3 31.147964 127 105 45.44617
## 4 1.4186633 4.4 4.4 9.147964 105 105 108.44617
## 5 0.3813367 2.7 2.6 56.852036 39 78 96.44617
## 6 0.4186633 3.4 3.3 5.147964 101 101 75.44617
## ALT_max ALT_min AST_diff AST_max AST_min Bilirubin_diff Bilirubin_max
## 1 40 75 131.35271 38 53 1.464039 0.4
## 2 206 26 116.35271 53 74 1.564039 1.2
## 3 91 75 65.64729 235 164 1.235961 3.0
## 4 12 12 154.35271 15 15 1.564039 0.2
## 5 24 32 154.35271 15 97 1.364039 0.4
## 6 60 45 122.35271 162 47 1.364039 0.4
## Bilirubin_min BUN_diff BUN_max BUN_min Cholesterol_diff Cholesterol_max
## 1 0.3 11.527053 13 13 16.42276 154
## 2 0.2 8.527053 18 16 28.42276 139
## 3 2.8 21.527053 8 3 56.42276 111
## 4 0.2 4.527053 23 20 37.42276 127
## 5 0.9 20.472947 45 45 55.42276 104
## 6 0.4 9.527053 19 15 55.57724 212
## Cholesterol_min Creatinine_diff Creatinine_max Creatinine_min DiasABP_diff
## 1 140 0.4324463 0.8 0.8 NA
## 2 128 0.4324463 1.2 0.8 26.54421
## 3 100 0.9324463 0.4 0.3 NA
## 4 119 0.5324463 0.9 0.7 NA
## 5 101 0.2324463 1.0 1.0 NA
## 6 212 0.3324463 1.4 0.9 20.45579
## DiasABP_max DiasABP_min FiO2_diff FiO2_max FiO2_min GCS_diff GCS_max GCS_min
## 1 NA NA 0.05192012 0.5 0.5 3.755971 15 15
## 2 81 32 0.44807988 1.0 0.4 8.244029 15 3
## 3 NA NA 0.44807988 1.0 0.5 6.244029 8 5
## 4 NA NA 0.44807988 1.0 0.4 3.755971 15 14
## 5 NA NA 0.15192012 0.4 0.5 3.755971 15 15
## 6 79 55 0.05192012 0.5 0.5 4.244029 9 7
## Gender Glucose_diff Glucose_max Glucose_min HCO3_diff HCO3_max HCO3_min
## 1 Female 65.14446 205 205 3.227452 26 26
## 2 Male 34.85554 105 105 1.772548 22 21
## 3 Female 20.85554 141 119 3.227452 26 24
## 4 Male 33.85554 129 106 5.227452 28 27
## 5 Female 26.85554 113 113 4.772548 18 18
## 6 Male 124.14446 264 197 3.772548 19 19
## HCT_diff HCT_max HCT_min Height HR_diff HR_max HR_min
## 1 2.739871 33.7 33.5 NA 29.077891 80 58
## 2 6.260129 29.7 24.7 175.3 7.077891 88 80
## 3 4.260129 28.5 26.7 NA 30.077891 113 57
## 4 10.339871 41.3 36.1 180.3 30.077891 88 57
## 5 8.360129 30.8 22.6 NA 20.077891 94 67
## 6 10.639871 41.6 36.8 180.3 16.077891 91 71
## ICUType K_diff K_max K_min Lactate_diff Lactate_max
## 1 Surgical ICU 0.2647934 4.4 4.4 0.9964037 1.9
## 2 Cardiac Surgery Recovery Unit 0.1647934 4.3 4.3 1.4964037 2.9
## 3 Medical ICU 4.4647934 8.6 3.3 1.4964037 1.9
## 4 Medical ICU 0.1352066 4.2 4.0 1.5964037 1.2
## 5 Medical ICU 1.8647934 6.0 3.8 0.8964037 2.0
## 6 Coronary Care Unit 0.9647934 5.1 3.8 1.8964037 0.9
## Lactate_min MAP_diff MAP_max MAP_min Mg_diff Mg_max Mg_min Na_diff Na_max
## 1 1.8 31.23164 109 56 0.4842982 1.5 1.5 2.2066071 137
## 2 1.3 34.76836 100 43 1.1157018 3.1 1.9 0.2066071 139
## 3 1.3 53.23164 131 71 0.6842982 1.9 1.3 2.2066071 140
## 4 1.5 24.23164 102 72 0.1157018 2.1 2.1 1.7933929 141
## 5 1.9 9.76836 78 68 0.4842982 1.5 1.5 0.7933929 140
## 6 1.3 24.23164 102 62 0.2842982 1.7 1.7 2.2066071 141
## Na_min NIDiasABP_diff NIDiasABP_max NIDiasABP_min NIMAP_diff NIMAP_max
## 1 137 17.49101 65 40 17.04069 92.33
## 2 139 19.49101 65 38 26.38069 86.33
## 3 137 37.50899 95 66 34.28931 110.00
## 4 140 23.50899 81 54 24.98931 100.70
## 5 140 38.50899 96 29 29.98931 105.70
## 6 137 31.50899 89 52 26.58931 102.30
## NIMAP_min NISysABP_diff NISysABP_max NISysABP_min PaCO2_diff PaCO2_max
## 1 58.67 40.30125 157 96 3.335797 37
## 2 49.33 44.69875 129 72 7.335797 41
## 3 83.33 33.30125 150 111 3.335797 37
## 4 73.00 23.30125 140 102 9.335797 38
## 5 63.67 39.30125 156 119 6.335797 34
## 6 61.67 35.69875 129 81 5.335797 45
## PaCO2_min PaO2_diff PaO2_max PaO2_min pH_diff pH_max pH_min Platelets_diff
## 1 38 47.61789 186 111 0.12011376 7.49 7.43 31.23069
## 2 33 286.38211 445 89 0.08011376 7.45 7.34 36.23069
## 3 37 93.61789 65 65 0.14011376 7.51 7.51 117.76931
## 4 31 94.61789 148 64 0.14011376 7.51 7.47 201.23069
## 5 35 80.61789 78 84 0.04011376 7.38 7.41 80.76931
## 6 35 80.61789 101 78 0.07988624 7.40 7.29 86.23069
## Platelets_max Platelets_min RespRate_diff RespRate_max RespRate_min SaO2_diff
## 1 221 221 7.34858 24 12 3.246079
## 2 226 164 16.65142 36 11 1.753921
## 3 84 72 13.65142 33 18 2.246079
## 4 391 315 7.34858 21 12 1.753921
## 5 109 109 6.65142 26 15 3.246079
## 6 276 219 27.65142 47 20 1.246079
## SaO2_max SaO2_min SysABP_diff SysABP_max SysABP_min Temp_diff Temp_max
## 1 98 94 NA NA NA 1.874083 38.1
## 2 99 97 50.3105 135 66 2.474083 37.9
## 3 95 95 NA NA NA 2.025917 39.0
## 4 99 97 NA NA NA 1.874083 36.7
## 5 97 94 NA NA NA 1.174083 37.8
## 6 97 96 43.3105 152 73 1.174083 37.8
## Temp_min TroponinI_diff TroponinI_max TroponinI_min TroponinT_diff
## 1 35.1 5.1429448 1.0 0.3 0.4785006
## 2 34.5 26.2570552 31.7 16.1 0.6485006
## 3 36.7 31.2570552 33.4 36.7 0.8814994
## 4 35.1 0.8570552 5.9 6.3 0.6485006
## 5 35.8 0.1570552 5.6 5.6 0.6085006
## 6 35.8 4.1429448 1.3 1.3 0.6385006
## TroponinT_max TroponinT_min Urine_diff Urine_max Urine_min WBC_diff WBC_max
## 1 0.58 0.19 800.78242 900 30 0.9331524 11.2
## 2 0.43 0.02 670.78242 770 0 4.7331524 13.1
## 3 1.55 1.41 310.78242 410 30 8.4331524 4.2
## 4 0.10 0.02 600.78242 700 100 3.3331524 11.5
## 5 0.06 0.37 83.21758 150 16 8.3331524 3.8
## 6 0.03 0.10 1100.78242 1200 40 11.8668476 24.0
## WBC_min Weight_diff Weight_max Weight_min
## 1 11.2 NA NA NA
## 2 7.4 4.699878 80.6 76.0
## 3 3.7 23.999878 56.7 56.7
## 4 8.8 3.900122 84.6 84.6
## 5 3.8 NA NA NA
## 6 14.4 33.300122 114.0 114.0
nrow(icu_patients_df1)
## [1] 2061
# Note: status = TRUE = event occurred, status = FALSE = no event
# summary the number of days
summary(icu_patients_df1$Days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 265 2408 1634 2408 2408
hist(icu_patients_df1$Days)
# check the status
table(icu_patients_df1$Status)
##
## FALSE TRUE
## 1288 773
# check age and gender
summary(icu_patients_df1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 52.00 67.00 64.41 78.00 90.00
summary(icu_patients_df1$Gender)
## Female Male
## 913 1148
The following variables of interest were chosen for the following reasons: Gender was chosen as we wanted to investigate whether this had an effect on patient outcomes alongside the ICU Type and Age. These variables were simple to test for and if significant, can provide some indication as to underlying causes of why death/ survival was the outcome. WBC variables were chosen as WBC counts can infer the presence of infection / chronic disease, as such we felt that this was an important set of variables to test for. Furthermore glucose levels were chosen as potentially significant variables to test for as they too can infer both underlying physiological/ renal system issues, or the presence of chronic disease (diabetes). Heart rate and respiratory rates were also chosen as variables of interest as these measures have well defined homeostatic ranges, which in turn, if out of this normal range, can infer the presence of an underlying physiological cause that could alter these values. Finally both the SAPS1 and SOFA variables were chosen as they represent an amalgamation of clinical measures that attempt to address sequential organ failure (SOFA) and a collection of physiological scores (SAPS1). As both these variables address many different clinical and physiological measures in one score, we felt it key to include these as variables of interest.
# ----------------------------------------Gender----------------------------#
summary(icu_patients_df1$Gender)
## Female Male
## 913 1148
# ----------------------------------------ICUType----------------------------#
summary(icu_patients_df1$ICUType)
## Coronary Care Unit Cardiac Surgery Recovery Unit
## 297 448
## Medical ICU Surgical ICU
## 788 528
# ----------------------------------------Age----------------------------#
summary(icu_patients_df1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 52.00 67.00 64.41 78.00 90.00
ggplot(data = icu_patients_df1, aes(x = Age)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$Age), col="blue")+
ggtitle("Age histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ----------------------------------------WBC_diff----------------------------#
summary(icu_patients_df1$WBC_diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03315 2.63315 4.53315 5.82079 7.23315 143.46685
ggplot(data = icu_patients_df1, aes(x = WBC_diff)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$WBC_diff), col="blue")+
ggtitle("WBC_diff histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ----------------------------------------WBC_min----------------------------#
summary(icu_patients_df1$WBC_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 7.60 10.40 11.51 14.10 128.30
ggplot(data = icu_patients_df1, aes(x = WBC_min)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$WBC_min), col="blue")+
ggtitle("WBC_min histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ----------------------------------------WBC_max----------------------------#
summary(icu_patients_df1$WBC_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 9.30 12.30 13.95 16.90 155.60
ggplot(data = icu_patients_df1, aes(x = WBC_max)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$WBC_max), col="blue")+
ggtitle("WBC_max histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ----------------------------------------Glucose_diff----------------------------#
summary(icu_patients_df1$Glucose_diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1445 23.8555 39.1445 57.0844 61.8555 1003.1445
ggplot(data = icu_patients_df1, aes(x = Glucose_diff)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$Glucose_diff), col="blue")+
ggtitle("Glucose_diff histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ----------------------------------------Glucose_min----------------------------#
summary(icu_patients_df1$Glucose_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 98.0 117.0 124.8 141.0 632.0
ggplot(data = icu_patients_df1, aes(x = Glucose_min)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$Glucose_min), col="blue")+
ggtitle("Glucose_min histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ----------------------------------------Glucose_max----------------------------#
summary(icu_patients_df1$Glucose_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.0 117.0 141.0 163.3 180.0 1143.0
ggplot(data = icu_patients_df1, aes(x = Glucose_max)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$Glucose_max), col="blue")+
ggtitle("Glucose_max histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ----------------------------------------SAPS1--------------------------------------#
summary(icu_patients_df1$SAPS1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 11.00 15.00 14.96 19.00 34.00 96
ggplot(data = icu_patients_df1, aes(x = SAPS1)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$SAPS1), col="blue")+
ggtitle("SAPS1 histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 96 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_vline).
# ----------------------------------------SOFA---------------------------------------#
summary(icu_patients_df1$SOFA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000 3.000 6.000 6.441 9.000 22.000
ggplot(data = icu_patients_df1, aes(x = SOFA)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$SOFA), col="blue")+
ggtitle("SOFA histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# --------------------------------------HR_max---------------------------------------#
summary(icu_patients_df1$HR_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 44.0 91.0 104.0 106.6 119.0 300.0
ggplot(data = icu_patients_df1, aes(x = HR_max)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$HR_max), col="blue")+
ggtitle("HR_max histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ---------------------------------------RespRate_max----------------------------------#
summary(icu_patients_df1$RespRate_max)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 24.00 27.00 29.12 33.00 98.00
ggplot(data = icu_patients_df1, aes(x = RespRate_max)) + geom_histogram(fill = "red")+
geom_vline(xintercept = mean(icu_patients_df1$RespRate_max), col="blue")+
ggtitle("RespRate_max histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# remove all the NA values from the dataset
icu_patients_df1 <- na.omit(icu_patients_df1)
# fitting a simple Cox model with gender using coxph
cox.gender <- coxph( Surv(Days, Status) ~ Gender, data=icu_patients_df1)
summary(cox.gender)
## Call:
## coxph(formula = Surv(Days, Status) ~ Gender, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## GenderMale 0.1183 1.1255 0.1477 0.801 0.423
##
## exp(coef) exp(-coef) lower .95 upper .95
## GenderMale 1.126 0.8885 0.8427 1.503
##
## Concordance= 0.511 (se = 0.021 )
## Likelihood ratio test= 0.64 on 1 df, p=0.4
## Wald test = 0.64 on 1 df, p=0.4
## Score (logrank) test = 0.64 on 1 df, p=0.4
# fitting a simple Cox model with ICUType using coxph
cox.ICUType <- coxph( Surv(Days, Status) ~ ICUType, data=icu_patients_df1)
summary(cox.ICUType)
## Call:
## coxph(formula = Surv(Days, Status) ~ ICUType, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## ICUTypeCardiac Surgery Recovery Unit -0.1845 0.8315 0.2275 -0.811 0.417
## ICUTypeMedical ICU 0.2261 1.2538 0.2132 1.061 0.289
## ICUTypeSurgical ICU 0.3238 1.3823 0.2061 1.571 0.116
##
## exp(coef) exp(-coef) lower .95 upper .95
## ICUTypeCardiac Surgery Recovery Unit 0.8315 1.2026 0.5323 1.299
## ICUTypeMedical ICU 1.2538 0.7976 0.8256 1.904
## ICUTypeSurgical ICU 1.3823 0.7234 0.9230 2.070
##
## Concordance= 0.555 (se = 0.022 )
## Likelihood ratio test= 7.16 on 3 df, p=0.07
## Wald test = 6.95 on 3 df, p=0.07
## Score (logrank) test = 7.05 on 3 df, p=0.07
# fitting a simple Cox model with Age using coxph
cox.Age <- coxph( Surv(Days, Status) ~ Age, data=icu_patients_df1)
summary(cox.Age)
## Call:
## coxph(formula = Surv(Days, Status) ~ Age, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Age 0.001164 1.001165 0.005857 0.199 0.842
##
## exp(coef) exp(-coef) lower .95 upper .95
## Age 1.001 0.9988 0.9897 1.013
##
## Concordance= 0.518 (se = 0.025 )
## Likelihood ratio test= 0.04 on 1 df, p=0.8
## Wald test = 0.04 on 1 df, p=0.8
## Score (logrank) test = 0.04 on 1 df, p=0.8
# fitting a simple Cox model with WBC_max using coxph
cox.WBC_max <- coxph( Surv(Days, Status) ~ WBC_max, data=icu_patients_df1)
summary(cox.WBC_max) # significant
## Call:
## coxph(formula = Surv(Days, Status) ~ WBC_max, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## WBC_max 0.018929 1.019109 0.006873 2.754 0.00588 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## WBC_max 1.019 0.9812 1.005 1.033
##
## Concordance= 0.548 (se = 0.025 )
## Likelihood ratio test= 6.43 on 1 df, p=0.01
## Wald test = 7.59 on 1 df, p=0.006
## Score (logrank) test = 7.64 on 1 df, p=0.006
# fitting a simple Cox model with WBC_diff using coxph
cox.WBC_diff <- coxph( Surv(Days, Status) ~ WBC_diff, data=icu_patients_df1)
summary(cox.WBC_diff)
## Call:
## coxph(formula = Surv(Days, Status) ~ WBC_diff, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## WBC_diff 0.017377 1.017529 0.008527 2.038 0.0416 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## WBC_diff 1.018 0.9828 1.001 1.035
##
## Concordance= 0.531 (se = 0.026 )
## Likelihood ratio test= 3.51 on 1 df, p=0.06
## Wald test = 4.15 on 1 df, p=0.04
## Score (logrank) test = 4.21 on 1 df, p=0.04
# fitting a simple Cox model with Glucose_max using coxph
cox.Glucose_max <- coxph( Surv(Days, Status) ~ Glucose_max, data=icu_patients_df1)
summary(cox.Glucose_max) # significant
## Call:
## coxph(formula = Surv(Days, Status) ~ Glucose_max, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Glucose_max 0.0021497 1.0021520 0.0007545 2.849 0.00438 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Glucose_max 1.002 0.9979 1.001 1.004
##
## Concordance= 0.563 (se = 0.025 )
## Likelihood ratio test= 6.48 on 1 df, p=0.01
## Wald test = 8.12 on 1 df, p=0.004
## Score (logrank) test = 8.14 on 1 df, p=0.004
# fitting a simple Cox model with Glucose_diff using coxph
cox.Glucose_diff <- coxph( Surv(Days, Status) ~ Glucose_diff, data=icu_patients_df1)
summary(cox.Glucose_diff) # significant
## Call:
## coxph(formula = Surv(Days, Status) ~ Glucose_diff, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Glucose_diff 0.0025905 1.0025939 0.0007899 3.28 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Glucose_diff 1.003 0.9974 1.001 1.004
##
## Concordance= 0.574 (se = 0.023 )
## Likelihood ratio test= 7.91 on 1 df, p=0.005
## Wald test = 10.76 on 1 df, p=0.001
## Score (logrank) test = 11.21 on 1 df, p=8e-04
# fitting a simple Cox model with SAPS1 using coxph
cox.SAPS1 <- coxph( Surv(Days, Status) ~ SAPS1, data=icu_patients_df1)
summary(cox.SAPS1) # significant
## Call:
## coxph(formula = Surv(Days, Status) ~ SAPS1, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## SAPS1 0.03439 1.03499 0.01450 2.372 0.0177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## SAPS1 1.035 0.9662 1.006 1.065
##
## Concordance= 0.578 (se = 0.023 )
## Likelihood ratio test= 5.67 on 1 df, p=0.02
## Wald test = 5.63 on 1 df, p=0.02
## Score (logrank) test = 5.62 on 1 df, p=0.02
# fitting a simple Cox model with SOFA using coxph
cox.SOFA <- coxph( Surv(Days, Status) ~ SOFA, data=icu_patients_df1)
summary(cox.SOFA) # significant
## Call:
## coxph(formula = Surv(Days, Status) ~ SOFA, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## SOFA 0.05041 1.05170 0.01903 2.649 0.00808 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## SOFA 1.052 0.9508 1.013 1.092
##
## Concordance= 0.59 (se = 0.024 )
## Likelihood ratio test= 7.1 on 1 df, p=0.008
## Wald test = 7.02 on 1 df, p=0.008
## Score (logrank) test = 7.04 on 1 df, p=0.008
# fitting a simple Cox model with HR_max using coxph
cox.HR_max <- coxph( Surv(Days, Status) ~ HR_max, data=icu_patients_df1)
summary(cox.HR_max) # significant
## Call:
## coxph(formula = Surv(Days, Status) ~ HR_max, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## HR_max 0.005178 1.005191 0.002198 2.355 0.0185 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## HR_max 1.005 0.9948 1.001 1.01
##
## Concordance= 0.563 (se = 0.024 )
## Likelihood ratio test= 4.73 on 1 df, p=0.03
## Wald test = 5.55 on 1 df, p=0.02
## Score (logrank) test = 5.5 on 1 df, p=0.02
# fitting a simple Cox model with RespRate_max using coxph
cox.RespRate_max <- coxph( Surv(Days, Status) ~ RespRate_max, data=icu_patients_df1)
summary(cox.RespRate_max)
## Call:
## coxph(formula = Surv(Days, Status) ~ RespRate_max, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## RespRate_max 0.01219 1.01227 0.00822 1.483 0.138
##
## exp(coef) exp(-coef) lower .95 upper .95
## RespRate_max 1.012 0.9879 0.9961 1.029
##
## Concordance= 0.539 (se = 0.027 )
## Likelihood ratio test= 2.16 on 1 df, p=0.1
## Wald test = 2.2 on 1 df, p=0.1
## Score (logrank) test = 2.21 on 1 df, p=0.1
# fitting series of multivariate Cox proportional hazards models
cox.mv1 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max, data=icu_patients_df1)
summary(cox.mv1)
## Call:
## coxph(formula = Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max,
## data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## SAPS1 0.01057 1.01063 0.01835 0.576 0.5645
## SOFA 0.03648 1.03715 0.02366 1.541 0.1232
## WBC_max 0.01505 1.01516 0.00728 2.067 0.0388 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## SAPS1 1.011 0.9895 0.9749 1.048
## SOFA 1.037 0.9642 0.9901 1.086
## WBC_max 1.015 0.9851 1.0008 1.030
##
## Concordance= 0.597 (se = 0.022 )
## Likelihood ratio test= 11.95 on 3 df, p=0.008
## Wald test = 12.63 on 3 df, p=0.006
## Score (logrank) test = 12.81 on 3 df, p=0.005
cox.mv2 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max, data=icu_patients_df1)
summary(cox.mv2)
## Call:
## coxph(formula = Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max +
## HR_max, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## SAPS1 -0.001324 0.998677 0.019245 -0.069 0.9451
## SOFA 0.043489 1.044448 0.023215 1.873 0.0610 .
## WBC_max 0.016200 1.016332 0.007437 2.178 0.0294 *
## HR_max 0.005377 1.005391 0.002585 2.080 0.0375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## SAPS1 0.9987 1.0013 0.9617 1.037
## SOFA 1.0444 0.9574 0.9980 1.093
## WBC_max 1.0163 0.9839 1.0016 1.031
## HR_max 1.0054 0.9946 1.0003 1.010
##
## Concordance= 0.598 (se = 0.024 )
## Likelihood ratio test= 15.83 on 4 df, p=0.003
## Wald test = 16.38 on 4 df, p=0.003
## Score (logrank) test = 16.55 on 4 df, p=0.002
cox.mv3 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max, data=icu_patients_df1)
summary(cox.mv3)
## Call:
## coxph(formula = Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max +
## HR_max + Glucose_max, data = icu_patients_df1)
##
## n= 190, number of events= 190
##
## coef exp(coef) se(coef) z Pr(>|z|)
## SAPS1 -0.0085564 0.9914801 0.0194297 -0.440 0.6597
## SOFA 0.0436689 1.0446364 0.0233888 1.867 0.0619 .
## WBC_max 0.0177555 1.0179140 0.0076591 2.318 0.0204 *
## HR_max 0.0048966 1.0049086 0.0025927 1.889 0.0589 .
## Glucose_max 0.0017174 1.0017188 0.0007929 2.166 0.0303 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## SAPS1 0.9915 1.0086 0.9544 1.030
## SOFA 1.0446 0.9573 0.9978 1.094
## WBC_max 1.0179 0.9824 1.0027 1.033
## HR_max 1.0049 0.9951 0.9998 1.010
## Glucose_max 1.0017 0.9983 1.0002 1.003
##
## Concordance= 0.614 (se = 0.023 )
## Likelihood ratio test= 19.84 on 5 df, p=0.001
## Wald test = 21.37 on 5 df, p=7e-04
## Score (logrank) test = 21.75 on 5 df, p=6e-04
# compare models
print(anova(cox.mv1, cox.mv2))
## Analysis of Deviance Table
## Cox model: response is Surv(Days, Status)
## Model 1: ~ SAPS1 + SOFA + WBC_max
## Model 2: ~ SAPS1 + SOFA + WBC_max + HR_max
## loglik Chisq Df P(>|Chi|)
## 1 -804.50
## 2 -802.56 3.8874 1 0.04865 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(anova(cox.mv2, cox.mv3))
## Analysis of Deviance Table
## Cox model: response is Surv(Days, Status)
## Model 1: ~ SAPS1 + SOFA + WBC_max + HR_max
## Model 2: ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max
## loglik Chisq Df P(>|Chi|)
## 1 -802.56
## 2 -800.56 4.0039 1 0.04539 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From comparing models with the anova() function, we found that cox.mv2 has significant p-value 0.049, thus we choose cox.mv2. Hence make further model comparisons to be sure all these variables should be retained using drop1()
cox.mv2 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max , data=icu_patients_df1)
print(drop1(cox.mv2, test="Chisq"))
## Single term deletions
##
## Model:
## Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max
## Df AIC LRT Pr(>Chi)
## <none> 1611.1
## SAPS1 1 1609.3 0.1938 0.65978
## SOFA 1 1612.6 3.4845 0.06194 .
## WBC_max 1 1613.9 4.8150 0.02821 *
## HR_max 1 1612.4 3.2384 0.07193 .
## Glucose_max 1 1613.1 4.0039 0.04539 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Remove SAPS1 and SOFA from the model after checking drop1() result
cox.refit <- coxph(Surv(Days, Status) ~ WBC_max + Glucose_max, data=icu_patients_df1)
print(drop1(cox.refit, test="Chisq"))
## Single term deletions
##
## Model:
## Surv(Days, Status) ~ WBC_max + Glucose_max
## Df AIC LRT Pr(>Chi)
## <none> 1612.0
## WBC_max 1 1616.5 6.4378 0.01117 *
## Glucose_max 1 1616.5 6.4915 0.01084 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The outputs from univariate cox show that the demographic variables like age and gender are not statistically significant, thus, we would not keep it in the models. And there is no significance regrading ICUType is also different with our initial expected.
Compared to the cox.refit model, cox.mv3 has a lower AIC value, so we prefer the cox.mv2 model.
Reminder: don’t forget to save this file, to knit it to check that everything works, and then submit via the drop box in OpenLearning.
When you have finished, and are satisfied with your assignment solutions, and this file knits without errors and the output looks the way you want, then you should submit via the drop box in OpenLearning.
If you encounter problems with any part of the process described above, please contact the course convenor via OpenLearning as soon as possible so that the issues can be resolved in good time, and well before the assignment is due.
Each task attracts the indicated number of marks (out of a total of 30 marks for the assignment). The instructions are deliberately open-ended and less prescriptive than the individual assignments to allow you some latitude in what you do and how you go about the task. However, to complete the tasks and gain full marks, you only need to replicate or repeat the steps covered in the course - if you do most or all of the things described in the revalant chapters of the HDAT9600 course, full marks will be awarded.
Note also that with respect to the model fitting, there are no right or wrong answers when it comes to variable selection and other aspects of model specification. Deep understanding of the underlying medical concepts which govern patient treatment and outcomes in ICUs is not required or assumed, although you should try to gain some understanding of each variable using the links provided. You will not be marked down if your medical justifications are not exactly correct or complete, but do you best, and don’t hesitate to seek help from the course convenor.